Data Exploration Gabriel Hamulic

Dataset: US Real Estate Dataset

Author

Hamulic, Gabriel

Published

November 9, 2025

Instructions

Your report must be of high quality, meaning that your report:

  • is visually and textually pleasing of
  • does not look/read/feel like a draft instead of a finished analysis
  • explains/discusses your findings and results in the main text, e.g., explain/discuss all figures/table in the main text
  • is representable such that it can show to any interested third party
  • uses figure/table captions/linking/reference (see example further down)
  • Do not show any standard printout of R-code, use for data.frame/tibbles knitr::kable() printing.
  • Do not simply print datasets (too many lines) use instead rmarkdown::paged_table()

Introduction

In this project, we explore a large dataset of over 2 million real estate listings across the United States. The data includes key property attributes such as price, size, lot area, location, and listing status.

Our goal is to gain a deeper understanding of the current U.S. housing market by identifying:

  • Patterns in property prices and sizes across states and cities

  • Relationships between housing characteristics (e.g., size, bedrooms, bathrooms)

  • Regional differences in affordability and market activity

Libraries

Libraries
library <- function(...) {suppressPackageStartupMessages(base::library(...))}
library(tidyverse)
library(dplyr)
library(knitr)
library(tidyr)
library(rmarkdown)
library(janitor)
library(scales)
library(tidytext)
library(ggforce)
library(GGally)
library(DT)
library(kableExtra)
library(broom)
library(plotly)

Data

Data source

In this data exploration we are looking at the US Real Estate market with use of a dataset from kaggle published by Ahmed Shahriar Sakib. It contains over 2.2 Million Real Estate listings broken down to State, Size, Price (among other factors). (Source: https://www.kaggle.com/datasets/ahmedshahriarsakib/usa-real-estate-dataset/data)

Data import

data = read.csv("data/realtor-data.zip.csv") # Data import

Data Transformation

data = subset(data, select = c(status, price, bed, bath, acre_lot, city, state, house_size)) # keep relevant columns

Structure with standard datatypes

Code
structure_tbl <- tibble::tibble(
  Variable = names(data),
  Type = sapply(data, function(x) class(x)[1]),
  Example = sapply(data, function(x) {
    val <- unique(x[!is.na(x)])[1]
    if (is.factor(val)) 
      as.character(val)
  }),
  Missing = sapply(data, function(x) sum(is.na(x)))
)

kable(
  structure_tbl,
  caption = "Structure summary of the dataset",
  align = c("l", "l", "l", "r")
)
Structure of cleaned dataset
Variable Type Example Missing
status character NULL 0
price numeric NULL 1541
bed integer NULL 481317
bath integer NULL 511771
acre_lot numeric NULL 325589
city character NULL 0
state character NULL 0
house_size numeric NULL 568484
Code
# Assign Data Types
data$status = as.factor(data$status)
data$city = as.factor(data$city)
data$state = as.factor(data$state)

NA Removal

Code
before_rows <- nrow(data)
data <- na.omit(data)
after_rows <- nrow(data)
kable(data.frame(Description = c("Before NA removal", "After NA removal"),
Rows = c(before_rows, after_rows)))
Description Rows
Before NA removal 2226382
After NA removal 1360716

The dataset now has 1360716 observations and 8 variables after removing rows with missing values.

Filtering

Code
# Filter min and max values
data = data |>
  filter(price > 10000 & price < 1000000000)

Calculations

Code
data = data |>
  mutate(price_per_sqm = price/house_size)

Cleaned Dataset

Code
paged_table(data)

Structure after transformation

Code
structure_tbl <- tibble::tibble(
  Variable = names(data),
  Type = sapply(data, function(x) class(x)[1]),
  Example = sapply(data, function(x) {
    val <- unique(x[!is.na(x)])[1]
    if (is.factor(val)) as.character(val) else as.character(round(val, 2))
  }),
  Missing = sapply(data, function(x) sum(is.na(x)))
)

kable(
  structure_tbl,
  caption = "Structure summary of the dataset",
  align = c("l", "l", "l", "r")
)
Structure of cleaned dataset
Variable Type Example Missing
status factor for_sale 0
price numeric 105000 0
bed integer 3 0
bath integer 2 0
acre_lot numeric 0.12 0
city factor Adjuntas 0
state factor Puerto Rico 0
house_size numeric 920 0
price_per_sqm numeric 114.13 0

After cleaning the dataset, all variables have appropriate data types and no missing values (n = 1360076):

  • status – Factor variable showing the listing status (e.g., for_sale, sold).

  • price – Numeric value for the property’s price in USD.

  • bed, bath – Integer counts of bedrooms and bathrooms.

  • acre_lot – Numeric size of the lot (in acres).

  • city, state – Factor variables identifying the property’s location.

  • house_size – Numeric size of the house (in square feet).

  • price_per_sqm – Numeric variable derived from price / house_size to compare prices across properties.

All rows with missing data were removed, and categorical variables were converted to factors for easier analysis and visualization later on.

Data dictionary

Code
tibble(
  Variable = c("price", "status", "acre_lot", "state", "house_size"),
  Description = c(
    "The price for which the item was listed on the market",
    "The status if the house is already sold or still for sale",
    "The size of the land / lot on which the house is located in acres",
    "The state in which the house is located",
    "The size of the house in square feet"
  )
) |>
  kable(
    caption = "Description of key variables in the dataset",
    align = c("l", "l")
  )
Description of key variables in the dataset
Variable Description
price The price for which the item was listed on the market
status The status if the house is already sold or still for sale
acre_lot The size of the land / lot on which the house is located in acres
state The state in which the house is located
house_size The size of the house in square feet

Summary statistic tables

In this section we will cover the summary of our cleaned dataset. We will explore basic statistical values from our data.

Numeric Statistics

Summary of numerical values

Code
data |> 
  janitor::clean_names() |>
  mutate(row = row_number() |> factor()) |> 
  pivot_longer(cols = where(is.numeric)) |> 
  group_by(name) |> 
  summarize(N = n(),
            min = min(value),
            mean = mean(value),
            median = median(value),
            max = max(value),
            st.dev = sd(value)
            ) |> 
  knitr::kable(digits = 2)
Summary statistics of numerical variables in dataframe
name N min mean median max st.dev
acre_lot 1360076 0.00 12.75 0.21 100000.0 799.83
bath 1360076 1.00 2.54 2.00 222.0 1.36
bed 1360076 1.00 3.39 3.00 444.0 1.43
house_size 1360076 100.00 2119.39 1812.00 1560780.0 4069.66
price 1360076 10400.00 573730.62 379000.00 515000000.0 1226596.98
price_per_sqm 1360076 0.02 262.42 197.42 491412.2 500.86

Interpretation

price: Very wide range ($10.4k–$51.5M). Mean ($573k), median ($379k), indicating strong right-skew and high-priced outliers.

house_size: Average ~2,119 sqft, median 1,812 sqft. Extremely large max (1,560,780 sqft) signals outliers. The distribution is right-skewed.

acre_lot: Median 0.21 acres vs. mean 12.75 acres → a few very large parcels inflate the mean.

bed / bath: Typical homes (~3 beds, 2 baths) with modest spread; minima at 1 suggest realistic counts.

price_per_sqm: Mean $262.42 vs. median $197.42, also right-skewed, consistent with price outliers.

Visualisation of numerical values

Code
data |>
  clean_names() |>
  pivot_longer(cols = where(is.numeric)) |>
  ggplot(aes(x = value, fill = name)) +
  geom_histogram(bins = 30, alpha = 0.7, color = "white") +
  scale_x_log10(labels = label_comma()) +   # 👈 echte Werte, log-Skala
  facet_wrap(~ name, scales = "free_x") +
  theme_minimal() +
  labs(
    title = "Distribution of Numerical Variables (logarithmic scale)",
    x = "Value",
    y = "Count"
  ) +
  theme(
    legend.position = "none",
    axis.text.x = element_text(angle = 25, hjust = 1) 
    )

Nominal Statistics

Summary of nominal variables (top categories)

Code
top_n_per_var <- 10 

nominal_summary <- data |>
  clean_names() |>
  select(where(is.factor), price) |>
  pivot_longer(cols = where(is.factor),
               names_to = "Variable",
               values_to = "Category") |>
  group_by(Variable, Category) |>
  summarise(
    Count = n(),
    Percent = round(100 * Count / nrow(data), 2),
    Mean_Price = round(mean(price, na.rm = TRUE), 0),
    .groups = "drop"
  ) |>
  group_by(Variable) |>
  slice_max(order_by = Count, n = top_n_per_var, with_ties = FALSE) |>
  ungroup()

kable(
  nominal_summary,
  caption = paste0(
    "Top ", top_n_per_var,
    " categories per factor variable (counts, share %, and mean price)"
  ),
  digits = 2,
  align = c("l", "l", "r", "r", "r")
)
Top categories for factor variables with counts, proportions, and mean price
Variable Category Count Percent Mean_Price
city Houston 19226 1.41 477651
city Tucson 7876 0.58 384816
city Phoenix 7694 0.57 543665
city Los Angeles 7556 0.56 1885626
city Dallas 7510 0.55 587276
city Philadelphia 7336 0.54 338467
city Richmond 6592 0.48 392538
city Orlando 6281 0.46 418841
city Fort Worth 6171 0.45 389780
city Saint Louis 5970 0.44 250029
state California 170954 12.57 1095518
state Texas 145394 10.69 451253
state Florida 127675 9.39 649826
state Arizona 54488 4.01 552916
state Pennsylvania 51922 3.82 343792
state New York 50935 3.75 669257
state Georgia 49234 3.62 422988
state Illinois 46901 3.45 357316
state Washington 46450 3.42 728113
state Virginia 44236 3.25 547994
status for_sale 750493 55.18 621383
status sold 609583 44.82 515063

Interpretation

city: Most listings in Houston, Tucson, Phoenix, and Los Angeles. Prices range widely — highest in Los Angeles (~$1.9M), lowest around $250k (Saint Louis).

state: California, Texas, and Florida dominate listings (>30% total). California shows the highest mean price (~$1.1M).

status: 55% for sale, 45% sold. Active listings are priced higher (~$621k vs. $515k).

Overall: Listings cluster in major U.S. cities and states, with strong regional price differences, especially high in California and large metro areas.

Visualisation of nominal variables (top categories)

Code
nominal_summary <- nominal_summary |>
  group_by(Variable) |>
  mutate(Category = forcats::fct_reorder(Category, Count),
         Category = factor(Category, levels = unique(Category))) |>
  ungroup()

# Plot: Facets untereinander, mit eigener x-Skala und y-Skala pro Variable
ggplot(nominal_summary, aes(x = Count, y = Category, fill = Variable)) +
  geom_col(show.legend = FALSE, alpha = 0.8, width = 0.7) +
  facet_wrap(~ Variable, ncol = 1, scales = "free", drop = TRUE) +
  scale_x_continuous(labels = label_comma()) +   #Tausendertrennung, keine 1e+05
  theme_minimal() +
  labs(
    title = "Top Categories per Factor Variable",
    x = "Count",
    y = "Category"
  ) +
  theme(
    panel.spacing.y = unit(1, "lines"),
    strip.text = element_text(size = 12, face = "bold"),
    axis.text.y = element_text(size = 8),
    plot.margin = margin(5, 15, 5, 5)
  )

Bivariate Analysis

Pairs Plot (all numeric variables)

Code
set.seed(123)

data_num <- data |>
  janitor::clean_names() |>
  select(where(is.numeric)) |>
  slice_sample(n = 3000) |>
  mutate(across(everything(), log1p)) 

p <- ggpairs(
  data_num,
  progress = FALSE,
  upper = list(continuous = wrap("cor", size = 4, alignPercent = 0.8, stars = TRUE)),
  lower = list(continuous = wrap("points", alpha = 0.3, size = 0.7)),
  diag = list(continuous = wrap("densityDiag", alpha = 0.7))
)

p +
  theme_minimal(base_size = 11) +
  theme(
    strip.text = element_text(size = 8, face = "bold"),
    panel.grid = element_blank(),
    axis.text = element_text(size = 8),
    axis.title = element_text(size = 9),
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5)
  ) +
  labs(title = "Pairs Plot (log-transformiert, n=3000)")

Price vs. House Size by Status

Code
# Stichprobe ziehen für Performance
set.seed(123)
sample_data <- data %>% sample_n(50000)

plot_ly(
  sample_data,
  x = ~house_size,
  y = ~price,
  color = ~status,
  type = "scatter",
  mode = "markers",
  alpha = 0.6
) %>%
  plotly::layout(
    title = list(text = "Relationship Between House Size and Price by Status"),
    xaxis = list(title = "House Size (sqft)"),
    yaxis = list(title = "Price ($)", type = "log")
  )

Interpretation

Positive relationship: Larger houses generally have higher prices, though the relationship weakens for very large properties.

Status comparison: Both for_sale and sold homes follow similar trends, but for_sale listings appear higher in price, suggesting sellers may list above sale values.

High variation: At similar sizes, prices vary widely — showing the strong influence of location and other factors.

Outliers: A few extremely large or expensive properties stretch the scale upward.

Multiple Regression

Code
model <- lm(price ~ house_size + bath + bed + state + status, data = data)
tidy(model) |>
  arrange(p.value) |>
  mutate(
    estimate = round(estimate, 1),
    std.error = round(std.error, 1),
    statistic = round(statistic, 1),
    p.value = signif(p.value, 3)
  ) |>
  datatable(
    caption = "Regressionsergebnisse (interaktiv)",
    filter = "top",         
    options = list(
      pageLength = 10,      
      autoWidth = TRUE,
      responsive = TRUE
    )
  )
Code
tidy(model, conf.int = TRUE) |>
  filter(term != "(Intercept)") |>
  mutate(term = reorder(term, estimate)) |>
  ggplot(aes(x = estimate, y = term, fill = estimate > 0)) +
  geom_col(show.legend = FALSE) +
  geom_vline(xintercept = 0, linetype = "dashed") +
  theme_minimal() +
  labs(
    title = "Greatest contributors to price",
    x = "regression coefficient",
    y = ""
  )

Average Property Price Map

Code
valid_states <- tibble(
  state_name = c(state.name, "District of Columbia"),
  state_abbr = c(state.abb,  "DC")
)
Code
map_price <- data |>
  group_by(state) |>
  summarise(avg_price = mean(price, na.rm = TRUE), .groups = "drop") |>
  inner_join(valid_states, by = c("state" = "state_name")) |>
  mutate(avg_price_k = avg_price / 1000)

plot_ly(
  map_price,
  type = "choropleth",
  locationmode = "USA-states",
  locations = ~state_abbr,
  z = ~avg_price_k,
  text = ~paste0(state, "<br>Avg Price: $", round(avg_price_k, 1), "K"),
  colorscale = list(c(0, 1), c("lightblue", "darkblue")),
  colorbar = list(title = "Avg Price ($K)")
) |>
  plotly::layout(
    title = list(text = "Average Property Price by U.S. State"),
    geo = list(scope = "usa", projection = list(type = "albers usa"))
  )

Interpretation

Regional variation: Western and coastal states show generally higher property prices, while central regions are lower.

Highest averages: States like California, New York, and Washington stand out with mean prices well above $1M.

Moderate prices: States such as Texas, Florida, and Arizona fall in the mid-range (~$400–650K).

Lower averages: Midwest and Southern states have more affordable properties on average.

Summary: Property values are heavily influenced by geography — with the highest prices concentrated along the coasts and major urban centers.

Average House Size Map

Code
map_size <- data %>%
  group_by(state) %>%
  summarise(avg_size = mean(house_size, na.rm = TRUE), .groups = "drop") %>%
  inner_join(valid_states, by = c("state" = "state_name"))

plot_ly(
  map_size,
  type = "choropleth",
  locationmode = "USA-states",
  locations = ~state_abbr,
  z = ~avg_size,
  text = ~paste0(state, "<br>Avg Size: ", round(avg_size), " sqft"),
  colorscale = list(c(0, 1), c("lightgreen", "darkgreen")),
  colorbar = list(title = "Avg Size (sqft)")
) %>%
  plotly::layout(
    title = list(text = "Average House Size by U.S. State"),
    geo = list(scope = "usa", projection = list(type = "albers usa"))
  )

Interpretation

General trend: Average house sizes are fairly consistent across most states, typically around 2,000–2,500 sqft.

Larger homes: Some central and mountain states (e.g., Colorado, Utah, Iowa) show slightly larger averages, possibly due to more available land.

Smaller homes: Coastal and densely populated states (e.g., New York, California) tend to have smaller average house sizes.

Price Range by US State

Code
map_extremes <- data |>
  group_by(state) |>
  summarise(
    min_price = suppressWarnings(min(price, na.rm = TRUE)),
    max_price = suppressWarnings(max(price, na.rm = TRUE)),
    .groups = "drop"
  ) |>
  mutate(range_price = max_price - min_price) |>
  inner_join(valid_states, by = c("state" = "state_name"))

plot_ly(
  map_extremes,
  type = "choropleth",
  locationmode = "USA-states",
  locations = ~state_abbr,
  z = ~range_price,
  text = ~paste0(
    state,
    "<br>Min: $", formatC(min_price, big.mark = ",", format = "f", digits = 0),
    "<br>Max: $", formatC(max_price, big.mark = ",", format = "f", digits = 0)
  ),
  colorscale = "Reds",
  colorbar = list(title = "Price Range ($)")
) |>
  plotly::layout(
    title = list(text = "Price Extrem Values by U.S. State (Max − Min)"),
    geo = list(scope = "usa", projection = list(type = "albers usa"))
  )

Interpretation

Highest ranges: California shows by far the largest price range (over $400M), driven by extremely high luxury property values.

Moderate ranges: States like Florida and parts of the Northeast also show wide price spreads, reflecting diverse markets from affordable to luxury homes.

Lower ranges: Most central and midwestern states have smaller price gaps, indicating more uniform housing markets.

Summary